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ABSTRACT 


Several concepts of information can be defined on algebras of 
events, and can be related to probabilities. Two very useful concepts 
are entropy and discrimination information, with applications to communi- 
cation theory and statistical inference, respectively. Conditional infor- 


mation can also be defined, given an event, or sub-algebra of events. 


A historical summary of information theory is given in Chapter 


II, which includes several generalizations of the information concept. 


The properties of conditional information are employed in 


reaching a general result concerning the information in a Markov process. 
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INTRODUCTION 


In this dissertation, we will be examining concepts of informa- 
tion and probability, and some applications. Working from an intuitive 
basis, we will consider how the two concepts are related, and will look at 
how concepts of information can be utilized in communication theory, stat— 


istical inference, and Markov processes. 


In chapter I, we introduce measures of probability and informa- 
tion on an algebra of events. Some properties of functions defined on a 
family of probability algebras are examined. Two such functions are 
entropy and discrimination information and their characteristics are listed. 
The contents of this chapter is a reformulation and unification of diverse 


results. 


The chapter II various ideas concerning information are collected 
together from several sources. Applications in communication theory and 


statistical inference are considered. 


In the final chapter, we are concerned with the information for 
discriminating between two Markov processes. After some preliminaries, a 
general result is proved for processes of the jump kind. We define a 
concept of infinitesimal information, and relate it to the discrimination 
information contained in an interval of time via an operator equation. The 


contents (except for some expository preliminaries) is original. 


Two theorems, which could not be found in the literature in the 


desired form, but which digress from the main path of the dissertation, are 


included as appendices. The proofs are our own. 
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CHAPTER I 


THE CONCEPT OF INFORMATION 


Tks Probability and Information. 


Information is concerned with knowledge, and knowledge is derived 
from observations. In real life, all our observations are limited in pre- 
cision, and our actions are finite in number. Mathematics has invented real 
numbers, which greatly simplify analysis, but no empirical scientist has 
ever observed '"'a real number". Rather he observes something in an interval 
whose size is determined by the precision of his instruments. But we can 
consider the real numbers as being "limits of precision", in a sense defined 
so as to "complete" the number system, either via Dedekind cuts, or Cauchy 


sequences. 


Similarly, few Borel sets can be "measured". Rather we can only 
measure unions of a finite number of intervals. The "events" of classical 
probability theory are again limits (in some sense) of empirical events, 
invented for both aesthetic and utilitarian reasons. In the limiting 
process we give birth to the orphans called "sets of measure zero", which 
are both possible and impossible. But the fact is that these sets are not 
in fact observable, so it is a probabilistic enigma to call them events. 
Rather they are reminders of the fact that we put no upper limit on our 


possible precision. We can accomplish the same be talking about a measure 
algebra without atoms. 


We will adopt the philosophy that any concept is the limit. in 


some sense, of that concept defined on finite systems. As much as possible 
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we will consider an algebra of events, rather than a "point set", as primi- 
tive. This "algebraic" approach to measure theory can be traced back to 
Caratheodory [4]. It has been applied to probability theory by Halmos [19], 
Birkhoff [2] and Kappos [25,26]. A fuller bibliography can be found in 
Kappos [26]. Information is defined on events and algebras of events, so 
that the "points" would for the most part be unnecessary baggage. However, 
there are times (particularly in the final chapter) where we would like to 
make use of standard results in analysis, and then we will represent our 


measure algebra in (%,A,u) style. 


Vek There are several interpretations of the concept "probability" [3], 
but for the moment we will adopt the subjectivist one that probability is 
a measure of belief or expectation. Mathematically, we can consider a 

class of events, or observations; the "more likely" observations having a 


"ones. For any two events, we 


higher probability than the "less likely 
can conceive of their logical conjunction and logical disjunction; and if 


we adopt the convention of a sure event I and an impossible event 0 , 


then our class has the structure of a Boolean algebra. 


Let us call our class of events E . Our probability can be 
expressed as a function p<: ¢ ~ [0,0] ‘euch that p(0) = 72 pil) = 15, 
Pleas ioe (0,1) and AAB = Q => p(AVB) = p(A) + p(B) , and hence the 
couple (E,p) defines a measure algebra. Denoting by "+" the symmetric 
difference on E , we find that 0(A,B) = p(AtB) is a metric, and to make 
things neater we usually talk about E , the completion of E . p can be 


extended to p on E by continuity, and it follows that p is continuous 


in both the metric and the stronger, order topology [26, chapter 2]. 
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Note that we have required the probability to be Strictly posi-— 
tive, i.e., there is only one event of measure zero, the impossible event. 
(Such a convention is analogous to the situation in set theory, where 
there is only one "empty" set.) In relating this approach to the classical 
theory, it must be borne in mind that our events are modulo the o - ideal 
of null sets - i.e., we identify sets which differ only by a set of measure 


zero. 


A "probability" which is not strictly positive we will call 


improper. A "probability" for which p(I) < 1 we will call defective. 


The completed algebra © isa o- algebra, and the probability 


p is oO - additive. All probability algebras in this dissertation will be 


assumed to be compiete in the above sense, and hence oO - algebras. 


If the elements of E are thought of as propositions, then [I 
represents a tautology, QO a contradiction, and the other elements are con- 
tingent propositions. Elements which are "smaller" (in the lattice sense) 
represent more specific propositions, and people are often interested in 


making their statements more specific. 


Lee Probability is, of course, a monotone function on E , and it is 
reasonable to expect that any function on E measuring information to be 
monotone decreasing, as one would think that more specific statements 


convey more information. We can draw a closer link between the concept of 


probability and information if we make a vague appeal to psychology, and 
maintain that the occurrence of an unexpected event (or equivalently, the 


confirmation of a dubious proposition) appears to convey more "information" 
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than the occurrence of an event we were expecting. 


One can thus postulate that a measure of information should be a 
decreasing function of the probability. It is reasonable to expect that 
the information provided by independent events should be additive, hence a 


reasonable measure of the information of an event would be 


J(A) = -log p(A) ; Oe eee are Be) 


3 One can deduce (1.2.1) axiomatically [24]: Let J be a non- 
negative, decreasing function on ig such: that... J(0) =.°.. si(D)a= 0... ) Let 
us also postulate the existence of a finite or countable family of subalge- 
bras iF which is such that for any family of events {A} such that 


INE Tse 
O a 


AA #0 . (31) 
Q 


Such a family is called M - independent. Intuitively, this means that 
elements of different Fete are mutually compatible, so that presumably 


information about the one gives no information about the others. 


Suppose now that J satisfies the following axioms: 


For is age 80 i ) Ca eae ee 


There is an operation T on the positive extended real axis, 


guch that for any A,B ¢ E such that AAB = 0 


J(AVB) = J(A) T J(B) : Cle aa) 
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J(A,) + [J(A,AA,) T J(A, AAS) Z J(ALTA,) 


STNG J(AAAL) 1 1 [J(A,)+I(A,AA,) ] rT [J(A,)+I (AAA, )] , tare 


Axiom (1.3.4) is somewhat technical and represents a limited distributive 


law. 


Not all operations T make the axioms consistent. If we also 
require that we can assign information values within any sub-algebra ne 
with no regard to the values in the others, then T can only take on two 


forms. 


x T y = inf(x,y) Qe eB) 


~x/e 4 See) : (1556) 


x T y = -c log (e 
If T is of the form (1.3.5), then there exists no probability 

measure p on E such that J can be represented as a strictly decreas- 
ing left-continuous function of p. If T is of the form (1.3.6), then 


there exists a probability measure p on E such that 


J = -c log p ‘ CPi) 


1.4 Equation (1.2.1) defines the information gained by the confirmation 
of an uncertain event. We can generalize this formula if we say that after 
our experiment, the probability of A is now q(A) . (This generalization 


makes sense in either the subjectivist or logical interpretations of prob- 
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ability, but it has no frequentist interpretation.) Then we can say that 


the information transmitted by this change of probability is 


J(A) = log q(A) - log p(A) : Cla) 


This quantity is properly described as information in favour of A , for 
- Ar i , 

hae ayy is confirmed, and hence q(A) = 0 , we have J(A) = -~ ; but we 
have not lost information about A , we have merely lost information in 


LAvVOUTLOL ik. 


In (1.4.1) q(A) is different from unity only if we do not 
directly observe A . We observe some other event B , and we know the 
stochastic link between A and B , and hence we can calculate q(A) ; 


i.e., if we observe B , then q(A) is the conditional probability 


p(A|B) = p(AAB)/p(B) 


Formula (1.4.1) also has another interpretation (which can be 
reduced to the former under a Bayesean model). Note that it is in the 
form of a log likelihood ratio, and hence we can interpret it as the infor- 
mation in favour of q against p , provided by the observation A. The 
connection between these two interpretations is formal more than conceptual, 
but we will see a closer connection in the sequel. We will call the second 


interpretation discrimination information [28,29]. 


2. Sub-algebras and Conditioning. 


So far we have been concerning ourselves with individual events. 


Let us now extend our horizons. 
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will speak about a finite partition. 


Every partition generates a sub-algebra consisting of the join of 
its atoms, and conversely, every atomic sub-algebra defines a partition. 
We will often use the terms partition and atomic sub-algebra ambiguously. 


Of course, every finite sub-algebra is automatically atomic. 


We can intuitively think of a partition as representing an exper- 


iment to determine which of the atomic propositions B in B is true. 


Dee We can define a (simple) random variable on a finite probability 
algebra as a function on the set of its atoms. We can also define elemen- 
tary random variables as functions on the atoms of a countable partition. 
Note that the Boolean algebra generated by a countably infinite partition 
is, except in trivial cases, uncountable. We can define the usual 
function space operations (sum, product, scalar multiplication, positive, 
negative part) on the set of all elementary random variables, where if f 


and g are defined on B and C respectively, then f wg is defined on 


BvC for any operation w . 


With these definitions, the set E(A) of all elementary random 
variables on A becomes a vector lattice, and we obtain the set of all 
random variables (A) as the order-completion of E(A) . It is well 


known that a non-negative measurable function (in the classical sense) is 
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the monotone limit of simple functions [37, p. 224], so that Kappos' 


approach is equivalent to the standard analytical approach. 


Again, we can define expectation in the obvious way on E(A) 
and extend it to L(A) < V(A) by continuity, following the Daniell 


approach [26, chapter V]. 


Zo A Boolean algebra is, of course, a Boolean ring. If 68 is a sub- 
ring of A , it may be a Boolean algebra per se, although it is not a sub- 
algebra of A. This is true in particular if B is finite, or if B is 
a principal ideal. Principal ideals of a Boolean algebra are of the form 


B= {Ae A: A< B} for some B. They will be denoted by BA . 


We will denote by F(A) and R(A) respectively the sets of all 
sub-o-algebras and sub-o-rings which are algebras. They are both complete 
lattices [2, p. 49] with greatest element A , and least elements respect- 
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If C is any class of Boolean o-algebras which is a lattice 


under the operations 


AAB=An 5B 


A v B = smallest 0-algebra containing A and B 


we will call it a lattice of o-algebras. It may not have a greatest, or a 


least element. 


We will say that C is a hereditary class if AeC => R(A) < C 
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We will require the following condition for any Boolean o- 


algebra A we are studying: 


There exists an increasing sequence of finite sub-algebras B 
n 


such that 
¥v B =A , C2 ep 
Such algebras will be called separable. It follows that for any probability 


measure p on A, A is also separable in the metric (A,B) = p(A+B) 


(20). 


2.4 If O#Be8B_,, then the principal ideal BA = {Ace A: A< B} is 
also a Boolean algebra, and a probability p on A induces a probability 


Pp on BA defined by 


Pp (C) of Cas By se (2.422) 


We can extend this measure as an (improper) probability on all of 
_ p(AB) 
Pp 6A) p(B) (Zot) 


the so-called conditional probability given B . We can make it a preper 


probability by considering A modulo B°A , which is isomorphic to BA. 


In this way, any finite partition 6 generates a family of 


probability algebras, one for each atom of the partition. 
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If f is any function defined on a hereditary class of probab- 
ility algebras, then given an algebra A and a finite partition B° in’ A 


we have a family of values 


f£[A|B] = £(BA) (224.3) 


indexed by the atoms B of B. The above expression in fact defines a 


simple (B-measurable) random variable, which we can denote by £ RIA De ket! 
we will denote the expectation by 
£(A|B) = ) p(B): £(BA) (Qe inky 


the summation extending over all atoms B of B. 


If A and B are sub-algebras of an algebra C , but we don't 
have B<A , we can still define £ [A|B] via (2.4.3) and hence £ (A|B) ; 


where by BA we still mean all events of the form BAA , AcA ,» although 


it is no longer an ideal of A. It is easy to show that 
£(A|B) = £(AvB|B) (2. 4&5) 
and that 
£(A|A) = £(T) 


If pD is any separable sub-algebra of C , and if we have 


finite B +90) , then we have a sequence of random variables 
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and if this sequence has a unique limit (in some sense) then we can talk of 


the random variable 


£(Al*) (eae) 


and its expectation £(A|D) - 


We will adopt the following terminology: The random variable 
will be called f conditioned by D , its expectation will be called f 


conditional on D. 


pes Conditional probability with respect to a non-atomic sub-algebra can 


be defined as a Radon-Nikodym derivative. 


For any A, P,(B) = p(AAB) is a measure on 8 , absolutely 
continuous with respect to p restricted to B , hence the Radon-Nikodym 


derivative, p(A|B) exists [26, p. 144]. 


Recall that we have only one impossible event, so that essen- 
tially we are identifying functions equal almost everywhere. Hence the 


derivative is unique. 


Furthermore, if A= Ay Vv Ay 0 = A, A Ay then 
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Hence, because of uniqueness, 
p(A, |8) + p(A, |B) ROWE) | Cae 


Also, if AS oA > then p(A_ |B) is increasing, and by monotone conver- 


gence, for any BeB 


BG ay p(A,|B)) = a ECL, one) 


ga p(A_ AB) 


p(AAB) A (2.582) 


Hence, again because of uniqueness, 


sup p(A_ |B) seeps |). an (20523) 
n 


We can thus consider conditional probability as a continuous vector-valued 


measure. 


Conditional expectation could be defined as expectation with 
respect to conditional probability (see [21] for integration with respect 


to vector measures), or directly in terms of Radon-Nikodym derivatives 
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240 Two events A and B are said to be independent if p(AaB) = 
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Two algebras A and B are conditionally independent given a 


EnirdealeebrasiC 5 if for ally AeA Bene, 


p(AAB|C) = p(A|C) p(B|C) (2.681) 


If A and B are independent, and finite, then the P, of 
(2.4.2) are identical to p for all B. It hence follows that for any 


function £ of section 2.4. 
£(A|B) p= (Aye (Zr. 2) 


If C is atomic, and A and B are conditionally independent given C 


then 
£(A|BvC) = £(A|C) (2.6.3) 


for the atoms of BvC are of the form BAC and Ppac = Pe 


3. Entropy and Discrimination Information. 


3.1 Let us suppose that we are in a situation where we are independently 
receiving information from a large number of algebras isomorphic to E 

Equivalently, we could imagine that after having an event in E confirmed, 
the situation changes, and our uncertainty is restored. It is then reason- 


able to ask about a measure of average or expected information. If F is 


a sub-algebra generated by a finite partition, then the expected informa- 


tion in F is naturally defined as 
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H(F) = - } p(A) log p(A) , (351.1) 


the summation extending over all atoms A of fF. This definition also 
makes sense in the case that F is generated by a countable partition, but 
in this case H(F) may be infinite. The quantity H is usually called 


entropy. 


Definition (3.1.1) can be derived axiomatically, either presup- 


posing a probability on E , or defining the probability in terms of H. 


We can define entropy axiomatically in terms of a probability. 
Such axioms were first presented by Shannon [39]. Various versions and 
simplifications have been summarized by Aczel [1]. Since the entropy is a 
function of the probability distribution, the axioms are often expressed in 
terms of n-tuples of positive numbers summing to unity. We will here 


rephrase them in terms of Boolean algebras. 


Let the entropy H_ be a function from the category of all finite 
probability algebras to the real numbers. For any finite algebra A , and 


a sub-algebra B , we can define the entropy of A conditional on 8 


according to (2.4.4). 


It is reasonable to want, 
H(A) = HB) FACAIB) “foreach 8 = Ale. (Sa2) 


It turns out that formula (3.1.2) along with the requirement that 
the entropy of a diatomic algebra be a Lebesgue measurable function of the 


probability is sufficient to characterize (3.1.1) up to a scale constant 


(30). 
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in fact, Lt is sufficient to require that (3.1.2) be true only 
for sub-algebras B , one of whose atoms is equal to the join to two atoms of 


A, the other atoms being identical to those of A. 


Bee We can also characterize (3.1.1) using a limited definition of 


probabilities [23]. 


Let F be the category of all finite Boolean algebras. F is a 


hereditary lattice of algebras according to section 2.3. 


For any Ae F let S(A) represent the set of all sub-rings of 
A. If $ is any homomorphism from A into B , it induces a map dy 
from S(A) into $(B) . A sub-class Hc EF is designated as a class 
containing homogeneous algebras. Within this sub-class, isomorphic algebras 


are identified. 


For any H€H , we define a probability by 


N (AH 
py (A) = cai (321) 
where N(A) represents the number of atoms of A. Let G= U S(H) 
HeH 
We will first define H only on G. 
We require the properties that 
(a)- St Sve i and ~ ais an automorphism of H then 


H(p,(B)) = H(B) for all Be S(H) . (seee2) 
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(b) H is isotone on Hos i-e.pgtte CoH ei and) G <i) then 


H(G) < H(H) G23) 


(c) If HeH and G is a sub-algebra of H (not necessarily in 


H) then 


H(H) = H(G) + H(H|G) (35204) 


H can be used to define a metric. If A and B are isomorphic, 


let 
6(A,B) = inf sup |H(C) - H(o,(C))| (Gr >) 
> CeS(A) 
where @$ ranges over all isomorphisms from A to B. Now, if A,B are 


not isomorphic, let (A,B) = 1 , otherwise let 


(A,B) -3en (3.2.6) 


Then p is a metric on G _, and under this metric, H is 
continuous. If we form the completion of G with respect to p, and 
extend H by continuity, it can then be shown that we can define a prob- 
ability p on each A€eEF , and that 


H(A), == das Py Wa log p, (A) " G2 
AaA 


the summation extending over all atoms A of A. Furthermore, these 


probabilities are consistent, in the sense that if B is a sub-—ring of A, 


then 
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p,| 8 


as Dae C3223) 


where Zp is the maximal element of B and by p,|B we mean p,alB 


restricted ‘to’ *B*! 


3.3 We can also talk about average information in the generalized sense 


Cian Cia Ly 


We can consider two finite sub-algebras A and 8B , the first 
consisting of events of interest, but not directly observable, the second 
consisting of observable events. We want to express the average informa- 


tion in A , transmitted via B. 


If we observe Be 8 _, the gain in information is: 


p(A|B) 


Jp fA) = log “p(A)_ 9 


(Seea) 


so that the average information in favour of A when A _ obtains is: 


x p(A|B) 
= Bia) Log = ae 2 
(A) = ] p(BlA) log —a5 G.22) 
Summing over all atoms B of 8B. The average information in A via B 
is hence: 
p(A|B) 
R(A,B) = ) p(A) ) p@lA) log Say— Ga) 


A B 


summing over all atoms A of A and B of B 


(8.2.8) 
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A typical application is communication theory, where A repre- 
sents transmitted symbols, and 56 represents received symbols. We know 
the probability distribution U of the transmitted symbols, and we know 
the noise characteristics Vv of the channel. V is expressed as condi- 
tional probabilities v(B| A) of reception given transmission. We then 


have p(A) = H(A) , 


p(B) = )} v(BIA) H(A) 


summing over all atoms A of A, and p(B{A) can be determined from 


Bayes' rule. Expression (3.3.3) can then be rearranged as: 


R(A,B) = - ) u(A) log u(A) + ) p(B) } p(A|B) log p(A|B) 
A B A 
= H(A) - H(A|B) (3.3.4) 


the summations extending over all atoms A of A and B of 8B, and 
we see that it is the difference of two entropies. H(A|B) is called the 
"equivocation of the channel", and R(A,B) is the rate of transmission 


ees 


R has also been used as a measure of the information provided by 


an experiment in a Bayesean framework [31]. 


a24 Similarly we can talk about the average discrimination information, 
as 
A 
I(p,4q3A) = } p(A) log jms (375.9 
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the summation extending over all atoms A of A. I can be thought of 
as the average information in A in favour of p against q , when the 


actual probability is p. 


It is easy to show that "rate of transmission" (3.3.3), (3.3.4) 


can be expressed as a discrimination. In fact 
R(A,B) = I(p,p ;AVB) C3742) 


* 
where p is the probability defined by taking A and 8B to be stochas- 


tically indpendent. 


It may be possible that events possible under q , are impossible 
under p , so that q may be defective considered as a probability on A. 
If events are impossible under q _ but possible under p , then total know- 


ledge can be gained, and we say that I[p,q] =™. 


When the probabilities p and q are understood, we will write 
simply I(A) . When the algebra is understood, and we wish to indicate the 


probabilities, we will use square brackets I[p,q] 


asd So far we have confined ourselves to information - theoretic concepts 
on finite algebras. Let us now attempt to extend them to the infinite case. 


We will restrict ourself to algebras A which are separable. 


Let us first consider the concept of conditional entropy H(A|B) : 
where for now we require that A and 8B are finite sub-algebras of a 
probabilitiy algebra B . This concept includes unconditional entropy since 


H(A) = H(A|T) , T being the trivial sub-algebra. 
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We have already noticed that H is increasing in its first areu- 


ment. We will now show that it is decreasing in its second. 


First, observe that 


t log t too) 


is a convex function and hence: 


) p(B) g(p(A|B)) > gC } p(B) pcalB)) 


elp(A)]  , 


the summations extending over all atoms B of B. Therefore, 


H(A|B) =- } } p(B) g(p(A[B)) 
<- } g(p(a)) 
= H(A) ) 


the summations extending over all atoms A of A and B of 


Further, we note that if C < B then 


H(A|C) = ) p(c) HCA) 


and 


(3s9).1) 
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H(A|B) = } p(B) H(BA) 
= §) upGdeHGAlca. ; (3.5.3) 
the summations extending over all atoms B of B and C of C. Hence, 
H(A|C) > H(A|B) (355.4) 


We can thus define, for any algebra B , 


H(A|B) = inf H(A|C) , where C is finite. (22545) 
C<B 


Also, we can define for any algebra A 


H(A|B) = sup H(C|B) (335..6) 
C<A 


3.6 We now turn our attention to discrimination information. It can 
readily be proved, again from the convexity of t log t , that 1(A|B) 


is increasing in its first argument. 


Hence we can define, for any algebra A , 


L(ALB)ebasupsl (GiB) wens (80621) 
C<A 


However, I is not monotone in its second argument. It is 
tivesuint dt Aes pee Chat T(A|B) < 1(A|C) SUDUCe LL tA] and abo abe 
incomparable it may be false. This is not surprising if one recalls that 


I(A|B) = I(AvB|B) 
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A relatively simple counter-example can be found: 


Let the atoms of A and B be respectively A, ,A and 


if 
By A B ; B, - Let p and q_ be defined by the following 


tables, on AVB. 


P q 
Ay eq 
Bo Bo 
a a 
ap, BS 


Then we find that I(A) = I1(A|T) = 0 but 


5 3 
Z 108 2- Z 108 3 


1(A|B) 


5425 : (3.6.2) 


It is quite in keeping with intuition that ancillary information from a 


previous experiment could improve the informativeness of a present experi- 


ment. 


i i icit if is an increas-— 
However, in spite of lack of monotonicity, BF 


Vv = h 
ing sequence of finite sub-algebras such that a By = B , then the 


-) does converge, as does the 


sequences of random variables Ip (A 
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sequence of their expectations, hence we can unabmiguously talk about 


*) whenever B is separable (see [18], proof of theorem 


I(A|B) and IR(A 


263) s 


The limit can in fact be represented as the discrimination infor- 


mation for the conditional probabilities of section 2.5. 


4. Properties of Entropy and Discrimination Information. 


We will here list some of the more significant aspects of these 


two quantities. 


4.1 (a) Entropy is non-negative and isotone. It is zero only for the 


trivial algebra 
A < B => H(A) < H(B) (4,161) 


HGCA) 270 HOA) =O)” att A= T5 Cs Bea | 


(b) Entropy is conditionally additive 
H(AVB) = H(B) + H(A[B).. (ied 23) 


This can be derived for the finite case from (3.1.2), substitu- 
ting AvB for A , and using (2.4.5). The property follows 


in the separable case by taking monotone limits. It follows from 


(2.6.2) that: 
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(c) Entropy is additive for independent subfields. A and B are 


independent 


=> H(AvB) = H(A) + H(B) : (4.1.4) 


(d) Conditional entropy is conditionally additive. 


H(AvB|C) = H(B|C) + H(A|BvC) (421.5) 


For atomic algebras the result follows from (4.1.3) for each 


atom of C . It follows in general by monotone limits. 


(e) Both conditioned and conditional entropy are continuous in both 


arguments. If 


then 


- 


Hp (Als) H(A *)>Ho (Al*) + H(A 
nh n 
H(A|B ) + H(A|[B),H(A|C_) + H(A|C) 


H(B IA) + (BIA), H(C_ |A) YH(C|A) . (4.1.4) 


For proofs see [34] where our conditional entropy is called 


conditional information. 


(£) Conditional entropy is anti-isotone in the conditioning algebra. 
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nD) (a) Discrimination information is non-negative and isotone. It is 
equal to zero if and only if the two measure algebras are iden- 


tical.: 


I(p,q3A) > 0 I(p,q,A) = 0 <=> p(A) = q(A) 
i] AcA , (4.292) 
This follows from the convexity of t log t , see [29]. 
(b) Conditional discrimination information is conditionally additive; 
I(AVB) = I(B) + I(A|B) ‘ (4.2.2) 


This has been proved (for separable B) by Ghurye [18]. If A 


and B are independent then 


I(AvB) = I(A) + I(B) : (4.205) 


(c) Conditional discrimination information is discrimination informa- 


* 
EDOM. Ed sie ss 4doq such that 
I(p,q3A|B) = 1(p,q”3A) (4.2.4) 
see [7], equation 3.3. 


(d) If f£ is the Radon-Nikodym derivative of p with respect to 
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This expression is usually taken as a definition. This result 
has been demonstrated by Kolmogorov, Gelfand and Yaglom [27] and 
Ghurye [18]. 


(e) Conditional information characterizes sufficiently, i.e., B<A 


1(A|B) = 0 iff B is a sufficient sub-field for the pair {p,q} 


(f) Discrimination information is continuous. 
Lt A, + A then T(A_) ASE CA) ws (4.2.6) 


Proof follows from the convexity of t log t and appendix l. 


4.3 Discrimination information, in measuring the ease in differentiating 
one probability distribution from another, in a sense measures a "distance" 
between probability measures. It is however, not a metric on the space of 
probability measures, for it is neither symmetric, not does it satisfy the 


triangle inequality. 


However, one can define convergence of probability in terms of 
discrimination information, and this convergence is stronger then conver- 
gence in the total variation norm [8]. In general, however, the conver- 


gence structure is only a Frechet~-V-space [16,17], and not even a topologi- 


cal space. 


In classical statistical problems, our set of possible probabil- 
ities is parametrized by a convex set in Euclidean space, and thus has not 
only a metric, but also a differentiable structure. If we let oe repre- 


sent the parameter for the null hypothesis, then [659] is a measure of 
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how easy it is to reject the null hypothesis under the alternative 9. 

The local behaviour of I about op Suggests how "secure" our statisti- 
cal inferences regarding oF are. Since T[6 ,0] is increasing away from 
oe » it is easy to see that the total differential, if it exists, must be 
zero. Hence the curvature would give an idea as to how quickly I moves 
away from its tangent plane, i.e., how quickly does the discrimination 
information increase. Let D(O.) be the matrix of second derivatives of 
T[6,»9] at ee - Then, provided that we assume sufficient regularity 


conditions, it is easily seen that DO.) is Fisher's information matrix 
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CHAPTER II 


A BRIEF HISTORY 


1. Information and Communication. 


A concept central to communication theory is a communication 
channel, consisting of a set of inputs X , a set of outputs Y , and a 
line between them Vv. For the moment we will leave these three components 


further unspecified. 


Lek Our first problem is to define what we mean by the amount of infor- 
mation which can be transmitted, that is, the information which can be con- 

; k 
veyed by X. If xX contains n symbols, then we can construct n 


"super-symbols") from k copies of X , and as we would 


sequences (or 
intuitively expect that k copies of X contain k times as much infor- 


mation, it seems reasonable to use c log n as a measure of the informa- 


tion potential of X. 


However, let us suppose we have two channels, and in both cases 
x, contains 2 symbols, 0 and 1. A message is sent along each channel 
once a second. However, in the first channel, 0 and 1 are sent with 
about equal regularity, but in the second one, most messages are ls .and)a 
0 occurs on the average only once a year. We would intuitively think that 


the second channel would be transmitting far less information. 


It hence appears that the probabilities of the symbols being 


transmitted affect the amount of information, so that X would be completely 
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described as a probability space (X,X,p) . 


Claude Shannon in 1948 [39] gave axioms which an information 
measure should satisfy, and derived from them the formula for entropy (I, 
3.1.1). Shannon's axioms were stronger than the ones we gave in (I, 3.1). 
He assumed, that H , as a function of the individual probabilities for 
fixed n (the number of symbols) was continuous; and was increasing in 
n , if the probabilities were uniform; and essentially (I, 3.1.2). It is 


customary to use logarithms to base 2. 


Entropy is maximum when the probabilities are uniform, and any 
"averaging process" (essentially a convolution) tends to increase it. These 
properties are what one would expect if entropy is thought of as a measure 
of disorder. The concept had been used in that context earlier in statis- 


tical mechanics [40]. 


If we consider the product set Xe with probability p” ,» as the 


set of n-symbol sequences sent independently, then we have: 
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ee ats ae is the approximate probability, for large n , of each "rea- 


sonably probable" sequence of n_ elements. 
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Leo2 If there is a one-to-one correspondence between the input symbols and 
output symbols, then we speak of a noise-less channel. In this case, no 
information is lost in transmission. The more usual case is where the link 
between the channels is stochastic. For each xe X » there is a probability 
distribution v(x,*) on Y . The situation is now identical with (1, 3.3). 
and we can define R(X,Y) as the rate of transmission. Note that the noise 


characteristic \V is assumed known. 


The supremum of R(X,Y) over all probabilities p on X is 
called the capacity of the channel. Coding theory is concerned with 


choosing symbols so as to maximize capacity. 


A | Many real-life communication channels transmit signals which are 
continuous varying voltages, and hence are not expressible as one. of a 
finite number of symbols. It would be useful to have an information mea- 
sure for such channels as well. Unfortunately, the obvious quantity (3.5.6) 


is always infinite in this case, and hence is not very useful. 
Shannon and Wiener [40] independently proposed the measure 
- E(log f) Gee 


(though with opposite signs,) f being the usual density of the probability 


distribution. 


Discrimination information has been generalized by Ghurye [18] 
where p is any finite measure, absolutely continuous with respect to q , 


which is oO - finite, and (1, 4.2.5) holds. ,Hence, the entropy of a contin- 
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uous distribution p can be written as I[p,q] , q being Lebesque measure, 


in a sense the 'most random" distribution. 


Continuous entropy may be infinite, so it has no "maximum". How- 
ever for all distributions concentrated on a bounded set, entropy is maxi- 
mum if the distribution is uniform; for all distributions concentrated on 
the positive real axis with expectation \ it is maximum if the distribu- 
tion is exponential 2 ; and for all distributions with variance 0? Ape 3 
is maximum if the distribution is normal o? . Entropy is location invari- 
ant but it depends on the scale. It may be negative. Like discrete entropy, 


continuous entropy is increased by an "averaging" process. 


a The rate of transmission can be defined in the continuous case even 
when the entropy is infinite, as a discrimination information. cf. (I, 


3.4.2). Let £ be the density of the input signal, let 
p(x,y) = f(x) v_(y) 


(v being the noise characteristic, see section 1.2) 
u(y) = | E(x) v,(y) dy 


and 


then 


R(X,Y) = I[p,q] A in ad 
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2. Information and Inference. 


Dial The concept of information in the context of statistics was intro- 
duced by R.A. Fisher [13,14]. He defined a "sufficient statistic" as one 
which contains all the information in an experiment. This intuitive idea 


is tersely expressed by (I, 4.2e). 


The concept was also used in the theory of estimation, as corres- 


ponding to reciprocal variance (of an estimator). The more dispersed is our 


estimator, the less efficient or informative it is. 


If 98 is a maximum likelihood estimator, it is the solution of 


OL(0) _ 
Tee a 
where 
n 
L= ) log f£y(x,) (2 siea) 
i=1 


“w~ 


£ being the density of each sample value x, If 8 is unbiased, and 


normally distributed, then we have 


wt ee (22) 
90 |e=6 Be 
971, ¥ 
so that we can use - pea as a measure of the informativeness of 0 
ele) 
We can also talk about expected information: 
= ay = & ee =n i(0) C21) 
I(8) = Eg [- 5 n ¢ ‘90 n ‘ Pals 
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Fisher calls the quantity i(9) the intrinsic accuracy of the density 
fo » and says it represents the maximum amount of information provided 


by a single observation [14, p. 709]. 


Under appropriate regularity conditions, we have the well-known 
Rao-Cramer inequality [35, 5 p. 477f£f£] which gives for an unbiased estimator: 


nN if 
> 
var (0) x 


nt) ORLA) 


Equality holds if and only if the estimator is sufficient and 
normally distributed, so that Fisher's idea of maximum information is well 
founded. (2.1.4) can be extended to the multi-parameter case [35]. Letting 


I be the matrix 


aiee fy 
- Fy (9 a8, 0=6 ere 
oo 
then we have that 
1 ewe (Zane) 


D representing the dispersion matrix. We have already seen the relation 


between the matrix I and discrimination information (I, 4.3). 


Zee 9 is a sufficient estimator, then we can write £, (x) = g,(t) = 
h(x|t) , where g is the density of 9. Then, 


Z 
92 log Ey (x) 0 log Bg (x) 
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Bearing this in mind, we can define, for any random vector T_ such that 


its one-parameter density is twice-differentiable, the information quantity: 


I, (T) ey ( ) : (2252) 


This quantity has certain attractive properties, [34, p. 5]. 


(a) I, (T) 2 0 equality holding iff T has the same distribution for 


abi Os. 


(b) I, (T) < 15 (X) if I is a statistic from “XX, equality holding 
only in the case of sufficiency. 


(c) I, (1, .T,) = I, (T,) + I, (To) if T) and T» are independent. 


Ze Fisher drew a parallel between his information and entropy [15, p. 
47]. Thermodynamic processes which are irreversible are accompanied by an 
increase in entropy. Statistical processes which are irreversible (i.e., 


data processing) cannot increase information. 


264 A complete theory of inference based on discrimination information 

has been developed by Kullback [29]. In it, the discrimination information 
is used as a pseudo-distance. Central to his development is an exponential 
family of distributions "closest" to a "null" distribution of all distribu- 


tions yielding the same expected value for a statistic. 
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ee The calculation of a statistic can be considered as a communications 
channel as in section 1 which is ambiguous but otherwise noiseless - i.e., 
the measures ,\(x,*) are degenerate. Csiszar [7] introduced the idea of 
a noisy channel in statistics, which may be realized in the case where an 
observation has an error in addition to the intrinsic error of the experi- 


mental material. He calls it the case of indirect observation. 


Not surprisingly, an indirect observation can only decrease the 
discrimination information. If the decrease is zero, the channel can be 
called sufficient. If the decrease is less than e€ , the channel can be 
called e€ - sufficient. (However, if the information in both the direct and 
indirect observations is infinite, the "decrease'' is undefined.) The same 
applies a fortiori to statistics and subfields. If a sufficient channel 
does not exist, or is expensive to construct, and € - sufficient one may be 


adequate. 


3. Various Tangents. 


Sack Renyi [36] has considered generalizations of entropy and discrimina- 


tion information by relaxation of the axioms. 


By replacing the conditional additivity (I, 3.1.2) by independent 


additivity (1, 4.1.4) he deduces that the quantities 
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also satisfy the axioms. It is easily shown that 


lim He (A) = H(A). (30152) 
Bo 


Renyi's treatment includes defective probabilities as well. In this case 


the quantity in (3.1.1) is divided by Phone. 


Similarly, Renyi generalized discrimination information to 


iL 
I,(A) = a77 log y q(A) eae se eur ers (e123) 
AcA 4 
the summation extending over all atoms A of A. This quantity again 


satisfies the independent additivity property (I, 4.2.3). 


Se Discrimination information was seen, in the general case, to be the 


supremum of discrimination information over all finite sub-algebras. 


An analogous quantity was defined by Ghurye [18] for any finite 
measure p dominated by a O-finite measure q defined on A , and any 


convex function f defined on [0,~] , as 


1,(A) = sup y q() aces B21) 


where the summation extends over all atoms B of the atomic subalgebras 


Buedis Ams 


be te » then we have the analogue of (1, 4.2.5) 
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We have seen that continuous entropy is an example of this 


generalization. We will need it again in chapter III. 


She Two of the most significant properties of discrimination information 
are that it is monotonically increasing: (B < A => I(B) < I(A) ; but is 
strictly increasing only for non-sufficient sub-algebras, i.e., 1(B) = I(A) 
if B is a sufficient sub-algebra). It turns out that convexity of t logt 
is the only assumption needed to prove these two facts. Csiszar [7] has 
hence defined for any convex function f , the f-divergence, between the 
probability measures p and q , defined by (3.2.2). To take care of the 


situation where p is not dominated by q , he defined, 


ee Pye 
0 £(5) = 0 and 0 £(5) 


‘ et 
= lim, ef(~) 


= Sa (3235) 


There may be situations where some f-divergences are more mean-— 
ingful than discrimination information. It has already been mentioned that 


the neighbourhood systems defined by I on the space of measures need not 


IAC 
define a topological space. However, if both £(0) and lim TD. 
tro 


ake 


finite and f is strictly convex at 1 , then I, does define a topology, 
equivalent to the variation difference, the latter itself being an f-diver- 


gence with f(t) = |c-1| [8]. 
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If p is not dominated by q , then discrimination information 
is infinite. However, if an event is impossible under q , yet has a posi- 
tive but very small probability under p , we may not want to think of this 
"discrimination distance" as so large, as in fact the two distributions may 
be very difficult to tell apart by experiment. By appropriate choice of 


f , we can allow I, to remain finite in such a case. 
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CHAPTER III 


INFORMATION IN MARKOV PROCESSES 


1. Basics of Markov Theory. 


tak Let us now suppose that we have a family of sub-algebras of a probab- 
ility algebra A , which has a temporal structure. Let T be a totally 

ordered set (which is usually assumed to be either the non-negative integers, 

or the non-negative real line). For every t € T , we will suppose we have 

a sub-algebra Be » called the algebra of events observed at epoch t 

Also, for any closed interval [s,t] , we have an algebra of events Coa 


called the algebra of events observed between epochs s and t . AS we 


will be dealing solely with separable algebras, we will assume that each 


Be and each a t is separable. We will, in fact, assume a stronger sep- 
b 


arability condition: 


If D is a dense countable subset of (s,t) then 


Cc = Vv B e Ga eae Be 


This latter condition, of course, is trivial in the case that T is 


countable. 


If we have a continuous probability p defined on A , we will 
call the system (A,T,B. OC. mNe. a separable stochastic process. We will 
b 


say that it has the Markov property if for any t and s >t , we have: 


The algebras Co ‘ and C. 5 are conditionally independent 
b) 
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We will say that is is temporally homogeneous if, for any s and 
te and “h “ts <-t) 
p(E|B.) = PCF |B) Bee ke Feb = Cin.) 


We will concern ourselves exclusively with temporally homogeneous 


processes having the Markov property. 


The quantities in 1.1.2 are called transition probabilities, and 
our separability condition 1.1.1 ensures that any event in any C . bh has 
3 


a unique probability defined in terms of them, conditional on Bo 


12 The preceding was a brief introduction to stochastic processes in 

the language of Chapter I. In order to make use of other results, without 
the need for lengthy reformulation, we will return to more standard nota- 
tion. We will assume that the algebras Be are all isomorphic, and hence 
can be represented by a family of random variables with values in the same 


space. 


Let (2,A,p) be our basic probability space, and let (E,€) 
be another measurable space which we call the state space. For each t € T 
let X, be a measurable function from {8 to E . Our algebras BE then 


are simply as (E) , and the values which X, takes on are our observa- 


bles. 
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Two processes Xx and Y, on the same state space are said to 


be equivalent if the finite-dimensional distributions are the same. ice. 
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We will be concerned exclusively with finite-dimensional distri- 
butions and their limits, so that we will essentially identify equivalent 
processes. As every stochastic process is equivalent to a separable pro- 
cess [32, IV, T29], our separability condition is hence no restriction. 
(Although our definition of separability appears more restrictive than 
Meyer's, I conjecture they are equivalent if one assumes continuity in 


probability. See [9, IL, Theorem 2.2].) 


Le Transition probabilities are usually expressediin terms of Markovian 


kernels, 112.) 10,733] 


P(x, F) = p(x 
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Given an initial distribution P, » we can define the distribu- 


tion at any time t by 
p(X, €F) = | P,(dx) P(x, F) (ivoe20 


hence we can see that the Markov kernels act as operators on the linear 
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Space of measures on E , which leave the subset of probability measures 


invariant. 


The also define operators on the dual space of functions on E , 


defined by 


(P,“£) (x) = | P.(x,dy) d(y) 


1.4 The operators fee form a semi-group of transition operators. 


That is, they satisfy the Chapman-Kolmogorov relation 
Po .PSert="P 5 Clive i) 


they take positive functions into positive functions, and they leave 
constant functions invariant. From (1.4.1) we can see that they commute 


with each other. They are of norm l. 


The operators can be embedded in a Banach algebra, in which 
analogues of classical algebraic and analytic procedures can be developed 
[21] including limits, differentiation and integration. There are two 
limits that concern us. We say that the operator sequence {T} converges 


to T uniformly or strongly as, respectively 
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D being the domain of the operators. Uniform convergence implies strong 
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convergence. 


Semi-groups of transition operators such that Po = I and 
that Be > I strongly as t ¥O are called the Feller semi-groups [33], 
and are in fact (strongly) continuous everywhere. We will deal exclusively 
with such semi-groups, and in fact we will suppose that the transition 
probabilities are conservative, which is to say they are non-defective prob- 


ability measures. 


If the limit 


* t 
A = lim ——— (12454) 


tv0 


exists uniformly, then we say that the semi-group is (uniformly) differen- 


tiable, and from this fact follows the Kolomogrov background equation: 


= A* Pp : (1450) 


In fact, the semi-group (P ") can be represented as: 


AUS Bt SS aes (ls o66) 


* * 
A is called the infinitesimal generator of the semi-group. 


If the limit in (1.4.4) does not exist uniformly, but if it 


exists strongly in some subspace C dense in D (GER 
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: . . 2 * 
the limit being in the norm), we still say that A is the infinitesimal 
* 
generator. (1.4.5) holds still [although 


Pt 
Tre is now only a strong 

derivative] but (1.4.6) need not, since AS may be an unbounded operator, 
so that the exponential function is not defined. However, the semi-group 


can be approximated by semi-groups of the form (1.4.6) [12]. We also have 


the inverse of (1.4.5), viz. 


* * C # 
P -IT=A 1 ds A Cie arc) 


This relation is true on D if (1.4.4) exists uniformly, and is 


SLLe true on Ce tf only (1.4.7) holds J) 10521) 


A Bsa, An important class of Markov processes is that of the "step" kind. 
That is, the process remains constant for a random length of time, and then 
jumps to another state, where it again remains for a random length of time. 


Only a finite number of jumps occur in any finite interval. 


There are three limit expressions which figure in the study of 


processes of this kind. 
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Convergence in any of the three can be pointwise or uniformly in 


x , and in (1.5.3) the convergence can be pointwise or uniform in F. 


(1.5.1) expresses the fact that if Set Sa then for t  suffi- 
ciently small Xe = x (almost surely). It is a necessary condition for 


the sample functions to be step functions. 


Uniform convergence of (1.5.1) is equivalent to the boundedness 
of A, and implies uniform convergence of (1.5.2) and uniform convergence 
of (1.553) in” fF . ~ Furthermore, A, is in fact a measure, so that 


A= Ay + A, can be considered a signed measure, for each x , with 


A(x,E) = 03 and A, (x,F) is a measurable function for each F [9, VI. 2]. 


If in fact the convergence in (1.5.3) is also uniform in x , then 


it follows that (1.4.4) exists uniformly, and 
A £(x) = | A(x,dy) f(y) Clee) 


(see appendix 2). 


3 
Even if (1.4.4) does not hold uniformly, the operator A will 


be bounded provided that the function A, is. 


2. Two Simple Examples. 


We will now consider discrimination information in a temporally 


homogenous Markov process. 
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Let us suppose that we have a separable stochastic process 
(A,T,B 4C. oP) - If q is another probability on A , we can consider 
I[p,q] , the discrimination information in favour of p against q 
contained in various sub-algebras of A. In particular we will be inter- 


ested in the information contained in the sub-algebras B, and C 
fo) 


t at 


Csiszar has considered discrimination information between two 
Markov chains with different initial probabilities, and the same transition 
probabilities [6]. Not surprisingly, 1(B,) decreases with t , in fact, 
if the chain is recurrent, irreducible and aperiodic, it converges to zero, 


hence proving ergodicity. 


We will take the opposite approach. We will assume that the 
initial subfield BE is given, and will be concerned with the discrimina- 


tion information between two families of transition probabilities. 


First, we will calculate the discrimination information directly, 


for two simple cases. 


2A Suppose we have two Poisson processes starting with Xx cota 0 Tepe ec 
probabilities p) and p,  , and with intensity parameters A, and Ay 
L 2 il 
respectively. We want to find the discrimination information in favour of 
P,. against p) (i.e. in favour of the hypothesis ) = Ay against the 
iL 


hypothesis A = Ay ) in the subfield C, . 
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For any integer n , let are represent the measurable set 
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and let ne be the O-algebra generated by these sets. The atoms of 


this algebra are 


(2252) 
; : n 
where ie ol) =e No 


Now, because the process has independent increments, we have 


(n) = ae rt Jy 
Peace ame se texD ora) Gare steer Cee sy 
ie n i=l 
Hence, 
By cs 
(n) n) 1+ 
I(p, 5p, 33°) =) Py (C8”) log] ————— Cin 
BE, na) aed p, (co) 
ia a 


ng 
(A,-A,)t + A,t log % 


Note that n does not figure in (2.1.4), so that using (1.1.1) and 


(1,4.2.6) we obtain 
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tea? Next, let us suppose that we have two Brownian motions starting at 
the origin with differing diffusion rates a and b. Again we want the 
discrimination information in favour of the hypothesis of = at against 
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Let ji) represent the algebra 


The discrimination information between two central normal distri- 


butions with different variances can readily be calculated as 


2 ye 
Oo Oo 
2E2 
L[9- G7 )i= 2 (log al preyils. 1 (P0210 
es 2 2, D 2) 
orem 2 


hence we see that the information depends only on the ratio of the variances. 
Thus, the information in the sub-field Be is equal to 


fs 


1 a 
5) (log 5 + 5 


1) 228 2) 


zy) 


and in it is equal to (using the property of independent increments) 


n a a 
2 dog 2+2-1) . (255) 


Letting n+ © , we see that the information in x Ls antinite. 


3. A General Formulation. 
eal Let us first define some notation. 


For any t let (t,x) be I, (B, |x) , the discrimination 
oO 


information in Be conditioned by Bo . It is the discrimination informa- 


tion between the transition probabilities with lag t , and is for given t 
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a random variable on B_, i.e. a function on E. 


Oo 
By W(t,x) we will denote Ip Cee _ 1s) the information in 
> 
Ce é conditioned by Be - It is also a function on E. If we know the 
initial distribution T , then 
IC ws) S918) ae | pCt,x) dt(x) . (35169) 
Ont Oo 


Obviously W is increasing in its first argument, and 


o<yp 5 (Sa 2) 


An important case is that in which  < © and equality holds, 
for then the observation at epoch t is as informative in discriminating 


the two probabilities as all events occuring beforehand. 
We noticed this case for the Poisson process. 


We will require the following regularity condition on 9: For 


every t , o(t,*) € D , the domain of the transition operators. 


In order to calculate 6 directly, we must have explicit expres- 
sions for the two sets of transition probabilities Pe and Q - For many 
concrete processes of the step kind, the parameters are defined in terms of 


the intensities defined in 1.5; and expressions for the transition probab- 


ilities can be very cumbersome. We will derive an operator equation for 


W , which does not involve $6. 
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We will maintain the following notation: Be will be the transi- 


tion probability of the process under study, its associated operator will be 


* 


x 
P. » A the infinitesimal generator. If the intensities of 1.5 exist, 


they will be denoted by A, and A, » A= AS + A, - For the process against 
which we are discriminating, we will use the respective symbols oF : oh A 
* 
Bo 4 By ; BL pe Dims 
Bie Let us first consider the case where our temporal space consists of 
the non-negative integers. 
Theorem 1: 
bel i 
W(t,x) = PL w(1,x) ; Cas2e8) 
k=0 
Proof: In this case $(1,x) = ~(1,x) then 
vB 
W(L=) = be x) 
a k=1 ie 
t-1 t-1 
=e Tee VboB | Git, (B| vB) 4x) « C22) 
Spat 6 ets 
By the Markov property, we have 
t-1 2 
Ip soclece Bye) = sp Sela ea GL xy (See 53) 
0 = 
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* 
W(t,x) = W(t-1,x) + Ae | w(1,x) (3.2134) 
(3.2.1) follows by induction. 


Corollary 1: If w(1,°) is constant, then we have simply (since the 


* 
Pe are transition operators) 


W(t,x) = t W(1,x) : (352.5) 
Sigs} Now let us consider the continuous time case. Denote by of a the 
n 
algebra Vv Be . The process restricted to this algebra can be considered 
Fail cab 
a discrete-time process, with transition operator oe . Hence we have, 
Erom (362.4) 
fie Ce ay e PoC, =) (Grae 
ee eka ie ct pe io aes 
fe) k=O. &= 
n 
We have that VG ie oA by our separability assumption, and hence 
n=1 z 
i G = ‘ 55 
lim Tz (G | x) = (t,x) (Sr 5-2) 
n e) 
1 ee * 
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N 4-0 kt ni, 
n 


becomes 


Hen [a Se oe, x). (53.39 


Saad 
a | sons do 


7 > . 7 ; 
ald oonte) ylomte wine Sais siandeaoo et ct it tL gaat a 


asioesoah sotanenean sm 


(@.4.6) i Gee Ba = Ce4 


ae ’ 
sit a vid Sjonmad .9285 smie Buourtinos $43 aebtenoo, eu Jol wou fe 
' i a zy - 


housb¥enos od nao sydegle ekdd O29 bs¥olvjess atgoorg say . ae ¥ sidogis. ? 
Lew ae 
=] 
: ae! 
eet au sone ., a soietsyo notskemesy dalw ,easoorq sty-eserzelb BO 
COS.) work + 
. Pm {~a : 
(12.8) + ese ast ay - 3 
fi 
soto bos ,nokjqmiieas yiilidsrsgess sa vd 1? « # v 
. me Ley 
{(Sv6.€) » Ged) =|. Fh) gt mbt 
: MDG en 
T=1 


acoeri wadt oa vd al i" eat TOJBTSO, a3 9toa9b au aod . 
he a 


53) 


3.4 Before we proceed, let us define a concept of infinitesimal informa- 


tion, which we will denote by L , by 


L(x) = Te CUE 
t¥0O 


whenever the limit exists. 


Now, the operator Sn + converges, (since the semi-group is con- 
3 


tinuous) to 


Hence, as all the operators are continuous, we find that 
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provided that L is defined as a uniform limit. 


We enunciate the result as 


Theorem 2: If L , the infinitesimal information, is defined as a uniform 


limit, then 7 
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Ge Suppose that we have that the function $(t,*) is constant for 


every t . This will be true if the following condition is satisfied: 


For every t € T and every x,x' € E , there exists a (measurable) 


permuation 7 of E - such that 


P_(x,1E) = P.(x',E) 


(Gey, 
Q. (x, TE) = Q.(x',E) . 


The aforementioned consequence of this condition is easily veri- 
field from the fact that a non-singular transformation leaves the informa- 


tion invariant [29> Chanter 2.Corr..4. i|. 


Condition (3.5.1) is satisfied in particular if the state space 
has a group structure, and the process has independent increments. In this 


case (3.4.3) takes on a very simple form: 


Corollary: if (325-1) is ‘satistied then 
W(t,x)) = Lt : Cieiee) 
S26 We now have a theorem which ensures the existence of the infinites- 


imal information. 
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the infinitesimal information L(x) = lim Mee) exists and is equal to 
ty0 
A(x, {x}) =a B(x; {x }) ae T[A, (x, +) ,B, (x, *)] 


Proof: We have that 


P(x F) 
o(t,x) = a ) P(x, F) log Q, GF) 5. wh» Linite. 


The summation extending over all atoms F of A. 


Because of the separability of E , we have in fact a sequence 


a * & , such that 


(t,x) = lim 1(P,(x,*) , Q.(x,°) 5 AQ) 


neo 


Without loss of generality, we may assume that {x} ¢€ ae Vcties 


Then we have 


(t,x) = Lim T(P, (x,°),Q, (x, *)5{x}° AL) + TCP, Gx, +) 0, (x5 Cox} 


noo 


o, (t,x) + > (t,x) . (3.6.1) 
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ns T(A, (x,*) 5B, (x, °)5A_) (3.6.2) 


and because the convergence is uniform in the elements of {x}° E we may 


interchange limits, so that 


Lim t $) (t,x) = Lim (A), (x,*),B, (x.°)5A_) 
tyO no y 
= T[A, (x, *) By (x, °) ] 5 (35623) 


It follows readily from 1'Hospital's rule that 


Ae Oe Cea a (aor eee BE Caen 
tvo 2 


Hence our theorem is proved. 


Ja 7 Equation (3.4.3), though interesting, is not very useful, because 
its right side could be rather cumbersome to evaluate. We therefore pre- 


sent another theorem. 


Theorem 4: be @.4.1) holds uniformly in x nd if Loe C , 


the domain of A* then 


Gar ae : (3.7.1) 


Proof: From (3.4.3) and (1.4.8) we have 
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From (3.4.3) we also see that 
ob. pty ; (eg. 3) 


Combining (3.7.2). and (3.7.3) we obtain (3.7.1). 
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APPENDIX 1 


A convergence theorem: 


Let (2,F,P) be a probability space. 
Let tA} be a family of O-algebras increasing to AcF. 
Let £ be a function convex on [0,~°) , and let p be a non-negative 
integrable function. 
Let 5 = E(p|A_) and $= E(p|A) . Further, let TE fo on ; 
. ; EO art 
Then i, “hal eorand E(i) EG) 
Proof: to, ;n=1,...,°} is a martingale, and hence O17 Peo [28]. 
As convexity implies continuity; i, = as fi ee ee eee Le et forms 


a sub-martingale. Hence n < m => 
. < 2 < e 
E(i_) S EG) “cis 


Thus a) is increasing and bounded above, and converges to some 


number SeEG ) . Now E(i) = E(i) - (Gy) ; 


By Fatou's lemma, 


lim E(i1) > E(i") 
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so it only remains to prove that 
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Let Xo be the largest number such that £(x,) = QQ. If there is no such 


number; then £ > 0 ; i, = 0 and the theorem is proved. Otherwise, let 


dx + B be the line of support of f at x, + iee., 


Ox, +B= 0 and PCr: xe 


Case 1: Suppose @>0. Let 8&f€x) -ax - B for x <x 


Oo 
= 0 EO Ce acat 
-— “oO 
Then g is convex, bounded and integrable. 
oe ES < ° 
*n f os —§& B 
hence by bounded convergence E(i,) > GUD) 
Case 2: Suppose 0) <= 0), ‘Leteye(x) = -Ox = 8 “for x => xX 
= 0 for ae 4 
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Then f£ < g 
i = ° < ° oa 
i £ p Eee O ae See 
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Hence, by dominated convergence Eis) > EG.) 
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APPENDIX 2 


A semi-group theorem: 


Let (E,E) be a measurable space. 
* 
Let ere be a semi-group of Markovian kernels on (E,E) and let Pe 


be the associated linear operators. 
Suppose that 


Pe(x,{x}) - 1 
A(x, Pop a tne Nacsa ee ee ed 


tO : 
and 
P(x,F) 
A, (x; F) = lim 
ae) 
* 
Ee =) Jb 
exist uniformly in x and Fe {x}° E . Then the limit Saar Gy ar 


exists uniformly as t ¥ 0 and A= A, a Ay is the kernel of this infin- 


itesimal operator. 


Proof: 
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and as we are taking the sup over functions for which lee ohh 
eta 1. Thus we can make the first quantity < a: The second 
quantity in | | is less than or equal to 
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because the convergence is uniform in both x and F. 


is proved. 


Thus the theorem 
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